Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation

作者信息：Huawei Pengfei Zuo & SYSU Zhi Zhou

链接：[2503.20552] Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation

摘要：In large language model (LLM) serving systems, executing each request consists of two phases: the compute-intensive prefill phase and the memory-intensive decoding phase. To prevent performance interference between the two phases, current LLM serving systems typically adopt prefill-decoding disaggregation, where the two phases are split across separate machines. However, we observe this approach leads to significant resource underutilization. Specifically, prefill instances that are compute-intensive suffer from low memory utilization, while decoding instances that are memory-intensive experience low compute utilization. To address this problem, this paper proposes Adrenaline, an attention disaggregation and offloading mechanism designed to enhance resource utilization and performance in LLM serving systems. Adrenaline's key innovation lies in disaggregating part of the attention computation in the decoding phase and offloading them to prefill instances. The memory-bound nature of decoding-phase attention computation inherently enables an effective offloading strategy, yielding two complementary advantages: 1) improved memory capacity and bandwidth utilization in prefill instances, and 2) increased decoding batch sizes that enhance compute utilization in decoding instances, collectively boosting overall system performance. Adrenaline achieves these gains through three key techniques: low-latency decoding synchronization, resource-efficient prefill colocation, and load-aware offloading scheduling. Experimental results show that Adrenaline achieves 2.28x higher memory capacity and 2.07x better memory bandwidth utilization in prefill instances, up to 1.67x improvements in compute utilization for decoding instances, and 1.68x higher overall inference throughput compared to state-of-the-art systems.

Motivation

Prefill随文本长度增加，Compute和Mem BW带宽趋于稳定，且Compute利用率高，Mem BW利用率低，低于30%。

Decode随Batch Size增加，Compute和Mem BW带宽趋于稳定，且Compute利用率低，低于26%，Mem BW利用率高。

Prefill是Compute Bound

Decode一般来说是Memory Bound

HBM Prefill节点占用低，低于21%，Decode节点占用高，资源使用不均衡

随着Batch Size变大，Decode部分Attention时增长到69.5%，成为HBM和Mem BW的最大开销

所以提出了解耦Decode的Attention部分并分配给Prefill计算

Challenges

Attention同步开销：需要在单次Decode部分的Attention完成计算和同步（时间少于1ms），否则会Stall Attention计算的每一层。
1. 缩短了同步的关键路径
2. CUDA Graph减少额外的kernel启动开销
Prefill阶段HBM资源的性能干扰问题：争用HBM和Compute。
1. 探索Compute-Constrained的Decode
2. 引入Resource-efficient的Prefill Colocation策略
Offloading Rate Control问题：太多会给Prefill节点带来过度的计算、太少会导致收益比不上同步开销。
1. 引入基于real-time compute and memory resource utilization的动态控制策略

Design

一、Decode节点的Low-latency Decoding Synchronization问题

1. 卸载工作流

如果卸载的时间比Decode节点Attn的时间长，那么会造成Stall

Alloc Block被解耦在关键路径之外

Gourp q,k,v一起发送

控制Attn执行时间

2. Kernel启动开销

For example, when executing the decoding phase of Llama-2 7B using one A100 (batch size is 8 and the sequence length is 1K), the average GPU time per transformer layer is 0.38 ms but the average CPU time is 1.137 ms, thus wasting 0.76 ms GPU time per transformer layer for CPU overhead.

vLLM是单维度的CUDA Graph，考虑Batch Size，每一轮动态捕获。

Adrenaline是两维度的CUDA Graph，考虑Batch Size和Offload tasks，捕获的CUDA Graph在1时被发送到另外一台机器。

二、Prefill节点的Colocation问题

1. 资源Profile

Attn Mem BW和SM Ratio：With the increase of used SMs, the HBM bandwidth utilization of the attention computation kernel super-linearly increases.
Latency和SM Ratio：With the decrease of used SMs, the prefill latency increases at a sub-linear level.

2. 性能隔离

The key idea is partition SMs according to the TTFT SLO

三、Load-aware Offloading Scheduling

需要考虑两个调度问题：

How many decoding attention tasks can be offloaded?
Given an ideal offloading ratio, how to efficiently determine whether offloading is necessary for a request in a dynamic workload?

1. bounding offloading ratio

在不影响TPOT的情况下最多安排多少offloading，Adrenaline从两个角度考虑卸载率：

卸载Attn到Prefill节点：Prefill节点的内存和带宽限制
在Decode节点的线性层计算：Decode节点的TPOT限制，Bmax代表使用offloading情况下最大Batch限制

所以最终上限则是

2. fine-grained adaptive scheduling

将太多请求offload到Prefill节点会给Prefill节点造成影响，且给Decoding带来较多的同步开销。

新请求及时推理长度最大也不超过upper bound
目前请求batch size和seq lens不能超过upper bound

Adrenaline